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Abstract 

Propagating input uncertainty through non-linear 
Gaussian process (GP) mappings is intractable. 

This hinders the task of training GPs using un¬ 
certain and partially observed inputs. In this 
paper we refer to this task as “semi-described 
learning”. We then introduce a GP framework 
that solves both, the semi-described and the 
semi-supervised learning problems (where miss¬ 
ing values occur in the outputs). Auto-regressive 
state space simulation is also recognised as a spe¬ 
cial case of semi-described learning. To achieve 
our goal we develop variational methods for han¬ 
dling semi-described inputs in GPs, and couple 
them with algorithms that allow for imputing the 
missing values while treating the uncertainty in a 
principled, Bayesian manner. Extensive exper¬ 
iments on simulated and real-world data study 
the problems of iterative forecasting and regres¬ 
sion/classification with missing values. The re¬ 
sults suggest that the principled propagation of 
uncertainty stemming from our framework can 
significantly improve performance in these tasks. 

1 INTRODUCTION 

In many real-world applications missing values can occur 
in the data, for example when measurements come from 
unreliable sensors. Correctly accounting for the partially 
observed instances is important in order to exploit all avail¬ 
able information and increase the strength of the inference 
model. The focus of this paper is on Gaussian process (GP) 
models that allow for Bayesian, non-parametric inference. 

When the missing values occur in the outputs, the corre¬ 
sponding learning task is known as semi-supervised learn¬ 
ing. For example, consider the task of learning to classify 
images where the labelled set is much smaller than the to¬ 
tal set. Bootstrapping is a potential solution to this prob¬ 
lem [Rosenberg et ak, 2005], according to which a model 


trained on fully observed data imputes the missing outputs. 
Previous work in semi-supervised GP learning involved the 
cluster assumption [Lawrence and Jordan, 2005] for clas¬ 
sification. Here we consider an approach which uses the 
manifold assumption [Chapelle et ak, 2006; Kingma et ak, 
2014] which assumes that the observed, complex data are 
really generated by a compressed, less-noisy latent space. 

The other often encountered missing data problem has to 
do with unobserved input features (e.g. missing pixels in 
input images). In statistics, a popular approach is to im¬ 
pute missing inputs using a combination of different edu¬ 
cated guesses [Rubin, 2004]. In machine learning, Ghahra- 
mani and Jordan [1994] learn the joint density of the in¬ 
put and output data and integrate over the missing values. 
For Gaussian process models the missing input case has re¬ 
ceived only little attention, due to the challenge of prop¬ 
agating the input uncertainty through the non-linear GP 
mapping. In this paper we introduce the notion of semi- 
described learning to generalise this scenario. Specifically, 
we define semi-described learning to be the task of learning 
from inputs that can have missing or uncertain values. Our 
approach to dealing with missing inputs in semi-described 
GP learning is, algorithmically, closer to data imputation 
methods. However, in contrast to past approaches, the 
missing values are imputed in a fully probabilistic manner 
by considering explicit distributions in the input space. 

Our aim in this paper is to develop a general framework that 
solves the semi-supervised and semi-described GP learn¬ 
ing. We also consider the related forecasting regression 
problem, which is seen as a pipeline where predictions 
are obtained iteratively in an auto-regressive manner, while 
propagating the uncertainty across the predictive sequence, 
as in [Girard et ak, 2003; Quinonero-Candela et ak, 2003]. 
Here, we cast the auto-regressive GP learning as a partic¬ 
ular type of semi-described learning. We seek to solve all 
tasks within a single coherent framework that preserves the 
fully Bayesian property of the GP methodology. 

To achieve our goals we need three methodological tools. 
Firstly, we need approximations allowing us to consider 
and communicate uncertainty between the inputs and 


the outputs of the non-linear GP model. For this, we 
build on the variational approach of Titsias and Lawrence 
[2010] which allows for approximately propagating den¬ 
sities throughout the nodes of GP-based directed graphi¬ 
cal models. The resulting representation is particularly ad¬ 
vantageous, because the whole input domain is now coher¬ 
ently associated with posterior distributions. We can then 
sample from the input space in a principled manner so as 
to populate small initial labelled sets in semi-supervised 
learning scenarios. In that way, we avoid heuristic self¬ 
training methods [Rosenberg et al., 2005] that rely on boot¬ 
strapping and present problems due to over-confidence. 
Previously suggested approaches for modelling input un¬ 
certainty in GPs also lack the feature of considering an ex¬ 
plicit input distribution for both training and test instances. 
Specifically, [Girard et al., 2003; Quinonero-Candela et al., 
2003] consider the case of input uncertainty only at test 
time. Propagating the test input uncertainty through a 
non-linear GP results in a non-Gaussian predictive density, 
but Girard et al. [2003]; Quinonero-Candela et al. [2003]; 
Quinonero-Candela [2004] rely on moment matching to 
obtain the predictive mean and covariance. On the other 
hand, Oakley and O’Hagan [2002] do not derive analytic 
expressions but, rather, develop a scheme based on simu¬ 
lations. McHutchon and Rasmussen [2011] rely on local 
approximations inside the latent mapping function, rather 
than modelling the approximate posterior densities directly. 
Dallaire et al. [2009] do not propagate the uncertainty of 
the inputs all the way through the GP mapping but, rather, 
amend the kernel computations to account for the input un¬ 
certainty. [Quinonero-Candela and Roweis, 2003] can be 
seen as a special case of our developed framework, when 
the data imputation is performed using a standard GP-LVM 
[Lawrence, 2006]. Another advantage of our framework is 
that it allows us to consider different levels of input un¬ 
certainty per point and per dimension without, in princi¬ 
ple, increasing the danger of under/overfitting, since input 
uncertainty is modelled through a set of variational rather 
than model parameters. 

The second methodological tool needed to achieve our 
goals has to do with the need to incorporate partial or un¬ 
certain observations into the variational framework. For 
this, we develop a variational constraint mechanism which 
constrains the distribution of the input space given the ob¬ 
served noisy values. This approach is fast, and the whole 
framework can be incorporated into a parallel inference al¬ 
gorithm [Gal et al., 2014; Dai et al., 2014]. In contrast, 
Damianou et al. [2011] consider a separate process for 
modelling the input distribution. However, that approach 
cannot easily be extended for the data imputation purposes 
that concern us, since we cannot consider different uncer¬ 
tainty levels per input and per dimension and, additionally, 
computation scales cubicly with the number of datapoints, 
even within sparse GP frameworks. The constraints frame¬ 
work that we propose is interesting not only as an inference 


tool but also as a modelling approach; if the inputs are con¬ 
strained with the outputs, then we obtain the Bayesian ver¬ 
sion of the back-constraints framework of Lawrence and 
Quinonero Candela [2006] and Ek et al. [2008]. How¬ 
ever, in contrast to these approaches, the constraint defined 
here is a variational one, and operates upon a distribution, 
rather than single points. Zhu et al. [2012] also follow the 
idea of constraining the posterior distribution with rich side 
information, albeit for a completely different application. 
In contrast, Osborne and Roberts [2007] handle partially 
missing sensor inputs by modelling correlations in the in¬ 
put space through special covariance functions. 

Thirdly, the variational methods developed here need to be 
encapsulated into algorithms that perform data imputation 
while correctly accounting for the introduced uncertainty. 
We develop such algorithms after showing how the consid¬ 
ered applications can be cast as learning pipelines that rely 
on correct propagation of uncertainty between each stage. 

In summary, our contributions in this paper are the fol¬ 
lowing; firstly, by building on the Bayesian GP-LVM [Tit¬ 
sias and Lawrence, 2010] and developing a variational con¬ 
straint mechanism we demonstrate how uncertain GP in¬ 
puts can be explicitly represented as distributions in both 
training and test time. Secondly, we couple our varia¬ 
tional methodology with algorithms that allow us to solve 
problems associated with partial or uncertain observations: 
semi-supervised learning, auto-regressive iterative fore¬ 
casting and, finally, a newly studied type of GP learning 
which we refer to as “semi-described” learning. We solve 
these applications within a single framework, allowing 
for handling the uncertainty in semi-supervised and semi- 
described problems in a coherent way. The software ac¬ 
companying this paper can be found at; http://git.io/A3TN. 
This paper extends our previous workshop paper [Dami¬ 
anou and Lawrence, 2014]. 

2 UNCERTAIN INPUTS 

REFORMUUATION OF GP MODELS 

Assume a dataset of input-output pairs stored by rows in 
matrices X G and Y G 3?” respectively. Through¬ 
out this paper we will denote rows of the above matrices as 
{yi_:,Xi :} and columns (dimensions) as {yj,Xj}, while 
single elements (e.g. yij) will be denoted with a double 
subscript. We first outline the standard GP formulation, 
where all variables are fully observed. By assuming that 
outputs are corrupted by zero-mean Gaussian noise, de¬ 
noted by 6/, we obtain the following generative model: 

Vhj = + (e/)*j, (e/)*j ~ A/'(0,/3-^) . (1) 

We place GP priors on the mapping f, so that the function 
instantiations F = {fj}j^i follow a Gaussian distribution 
p{ij |X) = Af (fj |0, K), where K is the covariance matrix 
obtained by evaluating the GP covariance function kf on 



the inputs X. Therefore, the model likelihood p(Y|X) is: 

r ^ 

/ p(Y|F)p(F|X) = J]AA(y,|0,K+ /?-'!) • (2) 

“'F i=i 

In the other end of the spectrum is the GP-LVM [Lawrence, 
2006], where the inputs are fully unobserved (i.e. latent). 
This corresponds to the unsupervised GP setting. In the 
absence of observed inputs, the likelihood p(Y|X) takes 
the same form as in equation (2) but the inputs X now 
need to be recovered from the outputs Y through maxi¬ 
mum likelihood. The Bayesian GP-LVM proceeds by ad¬ 
ditionally placing a Gaussian prior on the latent space, 
p(X) = nr=i 1^’ ^)’ approximately integrating 

it out by constructing a variational lower bound IF, where 

< logp(Y) = log [ p{Y\X)p{X), 

Jx. 

and by introducing a variational distribution 

where . is a diagonal matrix, so that diag(Si.) € 
3?'?. We can derive an expression for this variational bound, 

T = (logp(Y|X)),(x) - KL (g(X) || p{X)), (3) 

where (•)g(x) denotes an expectation with respect to g(X). 
Since X appears non-linearly insidep(Y|X) (in the inverse 
of the covariance matrix K -|- /3~^T), the first term of the 
above variational bound is intractable. However, we can 
follow [Titsias and Lawrence, 2010] to approximate the in¬ 
tractable expectation analytically. 

In this paper we wish to define a general framework that 
operates in the whole range of the two aforementioned ex¬ 
trema, i.e. the fully observed and fully unobserved inputs 
case. The first step to obtaining such a framework is to 
allow for uncertainty in the inputs. We assume that the in¬ 
puts X are not observed directly but, rather, we only have 
access to their noisy versions = Z € The 

relationship between the noisy and true inputs is given by 
assuming Gaussian noise: 

Xi^: = Zj ; -f ~ (0? j (4) 

SO that p(X|Z) = A/'(xi.:|zi^:, Sa;). Obviously, 

when this distribution collapses to a delta function we re¬ 
cover the standard GP case, and when Z is not given we 
recover the GP-LVM. The problem with the modelling as¬ 
sumption of equation (4) is that now we cannot use equa¬ 
tion (1), since the inputs are not available. On the other 
hand, if we replace x^ ^ in that equation with z^;, then we 
effectively ignore the input noise. McHutchon and Ras¬ 
mussen [2011] proceed by combining equations (1) and (4) 
to obtain the GP mapping fj(xi ^ — (ex)i,-.) which is then 


treated using local approximations. However, our aim in 
this paper is to consider an explicit input distribution. One 
way to achieve this is to treat the unobserved true inputs 
as latent variables to be estimated from the marginal like¬ 
lihood p(Y|Z) = Jj^p(Y|X)p(X|Z). Following D ami- 
anou et al. [2011] we can obtain a variational lower bound 
F' < logp(Y|Z), with: 

F = (logp(Y|X))^(x) - KL (g(X) |j p(X|Z)). (5) 

This formulation corresponds to the graphical model of 
Figure 1(a). However, with this approach one needs to ad¬ 
ditionally estimate the noise parameters which might 
be challenging given their large number and their interplay 
with the variational noise parameters {S^ Therefore 

we considered an alternative solution which we found to 
result in better performance. 
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Figure 1: Incorporating uncertain inputs Z in GPs through 
an intermediate input space X by considering: (a) a Gaus¬ 
sian prior on X, centered on Z and (b) a variational con- 


straint (dashed line) on the approximate posterior. Figure 
(c) represents our two-stage approach to dealing with miss¬ 
ing outputs for classification, where the dotted line repre¬ 
sents a discriminative mapping. 

2.1 VARIATIONAL CONSTRAINT 

An alternative way of relating the true with the noisy in¬ 
puts can be obtained by focusing on the posterior rather 
than the prior distribution. To start with, we re-express the 
variational lower bound of equation (5) as: 

from where we break the logarithm to obtain: 

F = logp(Y|Z) - KL (g(X) || p(X|Y, Z)). 

We see that the lower bound becomes exact when the vari¬ 
ational distribution q{X) matches the true posterior distri¬ 
bution of the noise-free latent inputs given the observed in¬ 
puts and outputs. To allow for this approximation we in¬ 
troduce a simple variational constraint which operates on 
the factorised distribution, which is now written as g(X|Z) 
to highlight its dependency on Z. In the simplest case 





where all inputs are observed but uncertain, the constraint 
just consists of replacing the variational means fii - of each 
factor 9 (xi_:) with the corresponding observed input 
The variational parameters . then account for the uncer¬ 
tainty. Similarly to the back-constraint of Lawrence and 
Quinonero Candela [2006]; Ek et al. [2008], our varia¬ 
tional constraint does not constitute a probabilistic map¬ 
ping. However, it allows us to encode the input noise 
directly in the approximate posterior without having to 
specify additional noise parameters or sacrifice scalability. 
Next, we elaborate on the exact form of the constraint. 

In the general case, namely having inputs that are only par¬ 
tially observed, we can define a similar constraint which 
specifies a variational distribution as a mix of Gaussian and 
Dirac delta distributions. Notationally we consider data 
to be split into fully and partially observed subsets, e.g. 
Z = (Z°, Z"), where o and u denote fully and partially 
observed sets respectively. The features missing in Z“ can 
appear in different dimension(s) for each individual point 
z“., but for notational clarity u will index rows containing 
at least one missing dimension. In this case, the variational 
distribution is constrained to have the form 


g(X|Z,{o,i^})=g(X°|Z^)<7(X“|Z“) 


where e —> 0 , so that the corresponding distributions ap¬ 
proximate a Dirac delta. Notice that for a partially observed 
row z“., we can still replace an observed dimension j with 
its corresponding observation in the second set of factors 
of equation ( 6 ), i.e. = z’^j, so g(X“|Z«) 7 ^ g(X“). 
Given the above, as well as a spherical Gaussian prior for 
p(X), the required intractable density logp(Y|Z) is ap¬ 
proximated with a variational lower bound: 


J- = (logp(Y|X)),(x|z) - KL (<?(X|Z) II p(X)), (7) 


Then, the marginal p(Y|X) can be obtained from 
Jensen’s inequality after introducing a variational distribu¬ 
tion ( 7 (F, U), so that P < logp(Y|X), where: 


P = 



< 7 (F,U)log 


p(Y|F)p(F|U,X)p(U) 

<?(F,U) 


( 8 ) 


Now the fist term of equation (7) is approxmated as 
(p(Y|X))^(x|z) > (•F) 9 (x|z)- However, this approx¬ 
imation is still intractable, since the problematic term 
p(F|U,X) still appears inside P and contains X in the 
inverse of the covariance matrix, thus rendering the expec¬ 
tation intractable. The trick of Titsias and Lawrence [2010] 
is to define a variational distribution of the form: 


g(F,U)=p(F|U,X)g(U). (9) 


Replacing equation (9) inside the bound of equation ( 8 ) re¬ 
sults in the cancellation of p(F|U,X), leaving us with a 
tractable (partial) bound, which takes the form: 


(p(Y|X)),(x|z) > ^^(xiz) = -KL(g(U) ||p(U)) 


/x,u 


g(X|Z)g(U) / p(F|U,X)logp(Y|F) 


( 10 ) 


The augmentation trick decouples the latent function val¬ 
ues given the inducing points, so that any uncertainty in the 
inputs can be propagated through the nested integral. After 
this operation, the inducing outputs U can be marginalised 
out. Therefore, the above integral is analytically tractable, 
since the nested integral is tractable and results in a Gaus¬ 
sian where X no longer appears in the inverse of the covari¬ 
ance matrix. The final lower bound to use as an objective 
function is thus obtained by using the partial bound of eq. 
(10) in place of the first term of equation (7), thus obtaining 
a new, final bound (more details in the Appendix): 

^2 = (.F),(x|z) -KL(q(X|Z)||p(X)). (11) 


where for clarity we dropped the dependency on {o,u} 
from our expressions. Since the Dirac functions are ap¬ 
proximated with sharply peaked Gaussians inside the pos¬ 
terior ( 7 (X|Z), the above variational bound can be com¬ 
puted in the same manner as the Bayesian GP-LVM bound 
of equation (3). Specifically, the KL term is tractable, since 
it only involves Gaussians. 

As for the first term of equation (7), we follow the Bayesian 
GP-LVM methodology and we augment the probability 
space with m extra samples U = of the latent 

function / evaluated at a set of pseudo-inputs (known as 
“inducing points”) X„, so that U G and X„ G 

sjfjmxg £)^g jQ jjjg consistency of GPs, p(U|X„) is a Gaus¬ 
sian distribution. From now on we omit dependence on X„ 
from our expressions. The likelihood then becomes: 

p(Y,F,U|X) =p(Y|F)p(F|U,X)p(U). 


To summarise, the variational methodology seeks to ap¬ 
proximate the true posterior with a variational distribution 
g(F,U,X) = q(F)q{V)q(X). To achieve this, g(F) is 
constrained to take the exact form p(F|U, X). This term 
is then “eliminated”, giving us tractability, but its effect 
is re-introduced through the variational distribution (in the 
nested integral of eq. (10)). Contrast this with the varia¬ 
tional constraint on q(X.)'. that approximate posterior fac¬ 
tor is constrained according to Z, so that the effect of Z is 
considered only through the q(X|Z) (eq. (11)). The above 
comparison gives insight in the conceptual similarity of the 
variational approach followed to obtain tractability and the 
one followed for handling partially observed inputs. 

The variationally constrained model is shown in fig. 1(b). 
The total set of parameters to be optimised in the objective 
function F 2 of equation ( 11 ) (e.g. using a gradient-based 
optimiser) are the model parameters {Of, /3), where Of de¬ 
notes the hyper-parameters of the covariance function kf. 



and the variational parameters (X„, {/x"., (( 7 (U) 

can be optimally eliminated, see Appendix). Depending on 
the application and corresponding learning algorithm, cer¬ 
tain dimensions of {/x“., S“.} can be treated as observed. 
Such algorithms are discussed in the following sections. 

3 GP LEARNING WITH MISSING 
VALUES 

We formulate both the semi-described and semi-supervised 
learning as particular instances of learning a mapping 
function where the inputs are associated with uncertainty. 
In both cases, we devise a two-step strategy based on 
our uncertain inputs GP framework, which allows to ef¬ 
ficiently take into account the partial information in the 
given datasets to improve the predictive performance. For 
brevity, we refer to the framework described in the previ¬ 
ous section as a variationally constrained GP, from where 
a semi-described, an auto-regressive and a semi-supervised 
GP approach are obtained as special cases, given the algo¬ 
rithms that will be explained in this section. 

3.1 SEMI-DESCRIBED LEARNING 

We assume a set of observed outputs Y that correspond 
to fully observed inputs Z° and partially observed inputs 
Z“, so that Z = (Z°, Z"). To make the correspondence 
clearer, we also split the observed outputs according to the 
sets {o,u}, so that Y = (Y*^, Y“), but note that both out¬ 
put sets are fully observed. We are then interested in learn¬ 
ing a regression function from Z to Y by using all avail¬ 
able information. Since in the variationally constrained 
GP the inputs are replaced by distributions and 

( 7 (X"|Z"), the uncertainty over Z" can be taken into ac¬ 
count naturally through this variational distribution. In this 
context, we formulate a data imputation-based approach 
which is inspired by self-training methods; nevertheless, it 
is more principled in the handling of uncertainty. 

Specifically, the algorithm has two stages; in the first step, 
we use the fully observed data subset (Z°, Y°) to train an 
initial variationally constrained GP model which encapsu¬ 
lates the sharply peaked variational distribution ( 7 (X°|Z°) 
given in equation (6). Given this model, we can then use 
Y" to estimate the predictive posterior' q{lG‘\Z^) in the 
missing locations of Z“ (for the observed locations we 
match the mean with the observations in a sharply peaked 
marginal, as for Z°). Essentially, we replace the missing 
locations of the variational means /x“. and variances S" 
of ( 7 (X"|Z") with the predictive mean and variance ob¬ 
tained through the “self-training” step. This selection for 
S“} constitutes nevertheless only an initialisation. In 

'The predictive posterior for test data Y* is obtained by max¬ 
imising a variational lower bound similar to the training one (eq. 
(11)), but X and Y are now replaced with (X, X*) and (Y, Y*). 


the next step, these parameters are further optimised to¬ 
gether with the fully observed data. Specifically, after ini¬ 
tializing ( 7 (X|Z) = ( 3 '(X‘''’,X“|Z) as explained in step 1, 
we proceed to train a variationally constrained GP model 
on the full (extended) training set ((Z°, Z"), (Y''^, Y“)), 
which contains fully and partially observed inputs. 

Algorithm 1 outlines the approach in more detail. This for¬ 
mulation defines a semi-described GP approach which nat¬ 
urally incorporates fully and partially observed examples 
by communicating the uncertainty throughout the relevant 
parts of the model in a principled way. Indeed, the predic¬ 
tive uncertainty obtained when imputing missing values in 
the first step of the pipeline is incorporated as input uncer¬ 
tainty in the second step of the pipeline. In extreme cases 
resulting in very non-confident predictions, for example in 
the presence of outliers, the corresponding locations will 
simply be ignored automatically due to the large uncer¬ 
tainty. This mechanism, together with the subsequent opti¬ 
misation of the parameters of ( 7 (X“|Z“) in stage 2, guards 
against reinforcing bad predictions when imputing missing 
values based on a smaller training set. The model includes 
GP regression and the GP-LVM as special cases. In par¬ 
ticular, in the limit of having no observed values our semi- 
described GP is equivalent to the GP-LVM and when there 
are no missing values it is equivalent to GP regression. 

There are some similarities to traditional self-training 
[Rosenberg et al., 2005], but as there are no straightforward 
mechanisms to propagate uncertainty in that domain, they 
typically rely on boot-strapping followed by thresholding 
“bad” samples to prevent model over-confidence. In our 
framework, the predictions made by the initial model only 
constitute initialisations which are later optimised along 
with model parameters and, hence, we refer to this step as 
partial self-training. Further, the predictive uncertainty is 
not used as a hard measure of discarding unconfident pre¬ 
dictions; instead, we allow all values to contribute accord¬ 
ing to an optimised uncertainty measure, that is, the input 
variances S^. Therefore, the way in which uncertainty is 
handled makes the self-training part of our algorithm prin¬ 
cipled compared to many bootstrap-based approaches. 

DEMONSTRATION 

We considered simulated and real-world data to demon¬ 
strate our semi-described GP algorithm. The simulated 
data were created by sampling inputs Z from a GP (which 
was unknown to the competing models) and then giving 
these samples as input to another unknown GP, to obtain 
corresponding outputs Y. For the real-world data demon¬ 
stration we considered a motion capture dataset taken from 
subject 35 in the CMU motion capture database. We se¬ 
lected a subset of walk and run motions of a human body 
represented as a set of 59 joint locations. We formulated 
a regression problem where the first 20 dimensions of the 
original data are used as targets and the remaining 39 as 



Algorithm 1 Semi-described learning with uncertain input GPs. 

1: Given: Fully and partially observed inputs, 7/^ and Z“ respectively, corresponding to fully observed outputs Y° and Y“. 


Construct g(X‘^|Z‘^) = (x^:|z°:,el) , where: e —>• 0 

Fix g(X°|Z°) in the optimiser 

Train a variationally constrained GP model with inputs g(X‘^|Z‘^) and outputs Y*^ 

for i = 1, • • • , |Y“| do 

Predict the distrihution A/” from the approximate posterior of model . 


# (i.e. (j'(X‘^|Z‘^) has no free parameters) 


Initialise parameters ofg(x“.|z".) = Af (x",|/i“., S") as follows: 

for j = 1, • • • , g do 


if Zij is observed then 


= zYj and = e, where: e 

Fix in the optimiser 

else 

and (Sn,., = (Sn.,. 


# (i.e. they don’t constitute parameters) 


14: Train model Ad®’" with inputs and outputs (Y‘^,Y“). The input distribution g(X<‘^’">|Z^°’">) = 

g(X‘^|Z‘^)g(X“|Z“) is constructed in steps 2, 5-13 and further optimised in the non-fixed locations. 

15: Model Ad*^’” now constitutes the semi-described GP and can be used for all required prediction tasks. 


inputs. That is, given a partial joint representation of the 
human body, the task is to infer the rest of the represen¬ 
tation. For both datasets, simulated and motion capture, 
we selected a portion of the training inputs, denoted as Z", 
to have randomly missing features. The extended dataset 
((Z°, Z"), (Y*^, Y“)) was used to train; a) our method, 
referred to as semi-described GP (SD-GP) b) multiple lin¬ 
ear regression (MLR) c) regression by performing nearest 
neighbour (NN) search between the test and training in¬ 
stances, in the observed input locations d) performing data 
imputation using the standard GP-LVM. Not taking into 
account the predictive uncertainty during imputation was 
found to have catastrophic results in the simulated data, 
as the training set was not robust against bad predictions. 
Therefore, the “GP-LVM” variant was not considered in 
the real data experiment. We also considered; e) a standard 
GP which cannot handle missing inputs straightforwardly 
and so was trained only on the observed data (Z'^, Y*^). 
The goal was to reconstruct test outputs Y* given fully ob¬ 
served test inputs Z*. For the simulated data we used the 
following sizes: |Z°| = 40, |Z“| = 60 and |Z*| = 100. 
The dimensionality of the inputs is q = 15 and of the 
outputs is p = 5. For the motion capture data we used 
|Z°| = 50, |Z"| = 80 and |Z*| = 200. In fig. 2 we plot the 
MSB obtained by the competing methods for a varying per¬ 
centage offv missing features in Z". For the simulated data 
experiment, each of the points in the plot is an average of 4 
runs which considered different random seeds. For clarity, 
the p—axis limit is fixed in figure 2, because some methods 
produced huge errors. The full picture is in figure 5 (Ap¬ 
pendix). As can be seen in the figures, the semi-described 
GP is able to handle the extra data and make much better 
predictions, even if a very large portion is missing. Indeed, 
its performance starts to converge to that of a standard GP 
when there are 90% missing values in Z“ and performs 


identically to the standard GP when 100% of the values are 
missing. We found that when q is large compared to p and 
n, then the data imputation step can be problematic as the 
percentage of missing features in Z" approaches 100% i.e. 
the method is reliant on having some covariates available. 
Appendix D discusses this behaviour, but a more system¬ 
atic investigation is left as future work. 

3.2 AUTO REGRESSIVE GAUSSIAN PROCESSES 

Having a method which implicitly models the uncertainty 
in the inputs of a GP also allows for doing predictions 
in an autoregressive manner [Oakley and O’Hagan, 2002] 
while propagating the uncertainty through the predictive 
sequence [Girard et al., 2003; Quinonero-Candela et al., 
2003]. Specifically, assuming that the given data Y consti¬ 
tute a multivariate timeseries where the observed time vec¬ 
tor t is equally spaced, and given a time-window of length 
T, we can reformat Y into input-output collections of pairs 
Z and Y as follows: the first input to the model, Zi_:, will 
be given by the stacked vector [yi,:, ...,yr,:] and the first 
output, yi will be given by yr-i-i,: similarly for the 

other data in Z and Y, so that: 

[^ 1 ,:, 22 ^:, = 

[[yi,:, y2,:, ■•■Yr,:], [y2,:, ys,:, yr-Tl,:] , •■•] , 

[yi,:,y2,:, ■•■Yn-r,:] = [yr-Tl,:, yr-r2.:, • , yn.:] ■ 

To perform extrapolation we first train the model on 
the modified dataset (Z,Y). By referring to the semi- 
described formulation described in Section 3.1, we assign 
all training inputs to the observed set o. After training, we 
can perform iterative prediction to find a future sequence 
Z» := [y„+i_:, y„+ 2 ,:,...] where, similarly to the approach 
taken by Girard et al. [2003], the predictive variance in each 






Figure 2: MSE for predictions obtained by different methods on semi-described learning. GP cannot handle partial ob¬ 
servations, thus the uncertainty {2a) is constant; for clarity, the errorbar is plotted separately on the right of the dashed 
vertical line (for nonsensical x values). The results for simulated data are obtained from 4 trials. For clarity, the limits on 
the y—axis are hxed, so when the errors become too big for certain methods they get off the chart. The errorbars for the 
GPLVM-based approach are also too large and not plotted. The full picture is given in figure 5 (Appendix). 


step is accounted for and propagated in the subsequent pre¬ 
dictions. The algorithm makes iterative l-step predictions 
in the future; initially, the output Zi * := y„+i.: will be pre¬ 
dicted (given the training set) with predictive variance S*;i. 
In the next step, the “observations” set will be augmented 
to include the distribution of predictions over y„+i,:, by 

dehning ^(xn+i^.|zi_*) = A/" Jz*_i, S*.i^, and so 

on. This simulation process can be seen as constructing a 
predictive sequence step by step, i.e. the newly inserted in¬ 
put points constitute parts of the (test) predictive sequence 
and not training points. Therefore, this procedure can be 
seen as an iterative version of semi-described learning. 

Note that it is straightforward to extend this model by ap¬ 
plying this auto-regressive mechanism in a latent space of 
a stacked model or, more generally, as a deep GP [Dami- 
anou and Lawrence, 2013]. By additionally introducing 
functions that map from this latent space nonlinearly to an 
observation space, we obtain a fully nonlinear state space 
model in the manner of Deisenroth et al. [2012]. For our 
model, uncertainty is encoded in both the states and the 
nonlinear transition functions. Correct propagation of un¬ 
certainty is vital in well calibrated models of future system 
behavior, and automatic determination of the structure of 
the model (e.g. the window size) can be informative in de¬ 
scribing the order of the underlying dynamical system. 

DEMONSTRATION: ITERATIVE FORECASTING 

Here we demonstrate our framework in the simulation of a 
state space model. We consider the Mackey-Glass chaotic 
time series, a standard benchmark which was also consid¬ 
ered by Girard et al. [2003]. The data is one-dimensional 
so that the timeseries can be represented as pairs of values 
{y, t}, f = 1, 2, • • • , n and simulates the process: 

^ = -bCit) + a i+g(~P)UM (a,6,T) = (0.2,0.1,17). 


Obviously the generating process is very non-linear, ren¬ 
dering this dataset challenging. We trained the autoregres¬ 
sive model on data from this series, where the modihed 
dataset {z,y} was created with r = 18 and we used the 
first 4t = 72 points to train the model and predicted the 
subsequent 1110 points through iterative free simulation. 

We compared our method with a “naive autoregressive” GP 
model where the input-output pairs were given by the au¬ 
toregressive modihcation of the dataset {z,y}. For that 
model, the predictions are made iteratively and the pre¬ 
dicted values after each predictive step are added to the 
“observation” set. However, this standard GP model has 
no straight forward way of incorporating/propagating the 
uncertainty and, therefore, the input uncertainty is zero for 
every step of the iterative predictions. We also compared 
against the method of Girard et al. [2003]^, denoted in the 
plots as “GPuncert”- Figure 3 shows the results for the last 
310 steps (i.e. t = 800 onwards) of the full free simula¬ 
tion (1110—step ahead forecasting); hgure 6 (Appendix) 
gives a more complete picture. As can be seen in the vari¬ 
ances plot, both our method and GPuncert are more robust 
in handling the uncertainty throughout the predictions; the 
“naive” GP method underestimates the uncertainty. Conse¬ 
quently, as can be seen in hgure 6, in the hrst few predic¬ 
tions all methods give the same answer. However, once the 
predictions of the “naive” method diverge a little by the true 
values, the error is carried on and amplihed due to under¬ 
estimating the uncertainty. On the other hand, GPuncert per¬ 
haps overestimates the uncertainty and, therefore, is more 
conservative in its predictions, resulting in higher errors. 
Quantihcation of the error is shown in Table 1 (Appendix). 


^We implemented the basic moment matching approach, al¬ 
though in the original paper the authors use additional approxima¬ 
tions, namely Taylor expansion around the predictive moments. 

























Figure 3: Chaotic timeseries: forecasting 1110 steps ahead by iterative prediction. The first 800 steps are not shown here, 
but figure 6 (Appendix) gives the complete picture. Comparing: a “naive autoregressive” GP which does not propagate 
(and hence underestimates) the uncertainties; the method of Girard et al. [2003], referred to as GPuncert; and our approach, 
which closely tracks the true test sequence until the last steps of the extrapolation. The comparative depiction of the 
predictions is split into two plots (for clarity), left and center. The rightmost plot shows the predictive uncertainties {2a). 
x—axis is the prediction step (t) and y—axis is the function value, f{t). 


3.3 SEMI-SUPERVISED LEARNING 

In this section we study semi-supervised learning which, in 
contrast to semi-described learning, is for handling missing 
values in the outputs. This scenario is typically encoun¬ 
tered in classification settings. Therefore, we introduce the 
sets {c,m] that index respectively the labelled and miss¬ 
ing (unlabelled) rows of the outputs (labels) Y. Accord¬ 
ingly, the full dataset is split so that Z = (Z^, Z-^) and 
Y = (Y^, Y^), where Z is now fully observed. The task 
is then to devise a method that improves classification per¬ 
formance by using both labelled and unlabelled data. 

Inspired by Kingma et al. [2014] we define a semi- 
supervised GP framework where features are extracted 
from all available information and, subsequently, are given 
as inputs to a discriminative classifier. Specifically, using 
the whole input space Z, we learn a low-dimensional la¬ 
tent space X through an approximate posterior ( 7 (X) « 
p(X|Z). Obviously, this specific case where the input 
space is uncertain but totally unobserved (i.e. a latent 
space) just reduces to the Bayesian GP-LVM model. No¬ 
tice that the posterior (/(X) is no longer constrained with 
Z but, rather, directly approximates p(X|Z), since we now 
have a forward probabilistic mapping from X to Z and Z is 
treated as a random variable with p(Z|X) being a Gaussian 
distribution, i.e. exactly the same setting used in the GP- 
LVM. Since there is one-to-one correspondence between 
X, Z and Y, we can notationally write X = (X^,X-^). 
Further, since we assume that y(X) is factorised across dat- 
apoints, we can write gjX) = q{'K^)q{'K-^). 

In the second step of our semi-supervised algorithm, we 
train a discriminative classifier from g(X^) to the observed 
labelled space, Y^. The main idea is that, by including 
the inputs Z^ in the first learning step, we manage to de¬ 


fine a better latent embedding from which we can extract 
a more useful set of features for the discriminative clas¬ 
sifier. Notice that what we would ideally use as input to 
the discriminative classifier is a whole distribution, rather 
than single point estimates. Therefore, we wish to take ad¬ 
vantage of the associated uncertainty; specifically, we can 
populate the labelled set by sampling from the distribution 
y(X^). For example, if a latent point xf. corresponds to 
the input-output pair (zf., yf.), then a sample from y(xf.) 
will be assigned the label yf.. 

The two inference steps described above are graphically 
depicted in Figure Ic. This is exactly the same setting 
suggested by Kingma et al. [2014], but here we wish to 
investigate its applicability in a non-parametric, Gaussian 
process based framework. The very encouraging results 
reported below point towards the future direction of apply¬ 
ing this technique in the framework of deep Gaussian pro¬ 
cesses [Damianou and Lawrence, 2013], so as to be able 
to compare to [Kingma et al., 2014] who considered deep, 
generative (but nevertheless parametric) models. 

DEMONSTRATION 

We evaluated our semi-supervised GP algorithm in two 
datasets; firstly, we considered 2000 examples from the 
USPS handwritten digit database [Hull, 1994]. These ex¬ 
amples contained the digits {0, 2,4, 6} and were split so 
that 800 instances were used as a test set. From the re¬ 
maining 1200 instances, we selected various portions to be 
labelled and the rest to be unlabelled. The experiment was 
repeated 8 times (each time involving different subsets due 
to different random seeds), so that we can include error- 
bars in our plots. Secondly, we considered the oil flow data 
[Bishop and James, 1993] that consist of 1000, 12 dimen- 













# Observed # Observed 

Figure 4: Plots of the number of incorrectly classified test points as a function of |Z^|. Multiple trials were performed, 
but the resulting errorbars are shown at one standard deviation. In small training sets large errorbars are expected because, 
occasionally, very challenging instances/outliers can be included and result in high error rates (for all methods) that affect 
the overall standard deviation. The Bayesian GP-LVM baseline struggled with small training sets and performed very 
badly in the oil dataset; thus, it is not plotted for clarity. 

sional observations belonging to three known classes cor¬ 
responding to different phases of oil flow. In each of the 
10 performed trials, 700 instances were used as a test set 
whereas the rest were split to different proportions of la¬ 
belled/unlabelled sets. Multi-label data can also be handled 
by our method, but this case was not considered here. 

Our method learned a low-dimensional embedding (/(X) 
from all available inputs, and a logistic regression classifier 
was then trained from the relevant parts of the embedding 
to the corresponding class space. We experimented with 
taking different numbers of samples from q(X^) for popu¬ 
lating the initial labelled set; the difference after increasing 
over 6 samples was minimal. Also, when using only the 
mean of (as opposed to using multiple samples) we 

obtained worse results (especially in the digits data), but 
this method still outperformed the baselines. We compared 
with training the classifier on features learned by (a) a stan¬ 
dard Bayesian GP-LVM and (b) PCA, both applied in 
Both of the baselines do not take Z^ into account, nor do 
they populate small training sets using sampling. Figure 4 
presents results suggesting that our approach manages to 
effectively take into account unlabelled data. The gain in 
performance is significant, and our method copes very well 
even when labelled data is extremely scarce. Notice that all 
methods would perform better if a more robust classifier 
was used, but logistic regression was a convenient choice 
for performing multiple trials fast. Therefore, our conclu¬ 
sions can be safely drawn from the obtained relative errors, 
since all methods were compared on equal footing. 

4 DISCUSSION AND FUTURE WORK 

We have defined semi-described learning as the scenario 
where missing and uncertain values occur in the inputs. We 


considered semi-described problems to be part of a general 
class of missing value problems that also includes semi- 
supervised learning and auto-regressive future state sim¬ 
ulation. A principled method for including input uncer¬ 
tainty and partial inputs in Gaussian process models was 
also introduced to solve these problems within a single, co¬ 
herent framework. We explicitly represent this uncertainty 
as approximate posterior distributions which are variation- 
ally constrained. This allowed us to further define algo¬ 
rithms for casting the missing value problems as particular 
instances of learning pipelines which use our variationally 
constrained GP formulation as a building block. Our algo¬ 
rithms resulted in significant performance improvement in 
forecasting, regression and classification. We believe that 
our contribution paves the way for building powerful mod¬ 
els for representation learning from real-world, heteroge¬ 
nous data. In particular, this can be achieved by combin¬ 
ing our method with deep Gaussian process models [Dami- 
anou and Lawrence, 2013] that use relevance determination 
techniques [Damianou et ah, 2012], so as to consolidate 
semi-described hierarchies of features that are gradually 
abstracted to concepts. We plan to investigate the appli¬ 
cation of these models in settings where control [Deisen- 
roth et ah, 2014] or robotic systems learn by simulating 
future states in an auto-regressive manner and by using in¬ 
complete data with miminal human intervention. Transfer 
learning is another promising direction for applying these 
models. 
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A APPENDIX: VARIATIONAL LOWER 
BOUND 

In this appendix we give some more details on the compu¬ 
tation of the variational lower bound for the variationally 
constrained model. 

The augmented joint probability density (after introducing 
the inducing points) takes the form, 

p(Y,F,U,X|X„) 

= p(Y|F)p(F|U, X, X„)p(U|X„)p(X) 

nP(y.|f^)p(f,>,,X,X„)p(u,|X„)j p(X). 

In the r.h.s above, the observed inputs Z do not appear, ex¬ 
actly because we introduce them through the variational 
constraint, which does not constitute a probabilistic map¬ 
ping. In the above equations we have 

p(f,-|uj,X,X„) =A/'(fj|aj,S/), 

being the conditional GP prior with 

3.j — and — K KfuKmjKuf 

and 

p(uj|X„) = A/'(uj|0,Kuu), 

is the marginal GP prior over the inducing variables. In 
the above expressions, Kuu denotes the covariance matrix 
constructed by evaluating the covariance function on the 
inducing points, Kuf is the cross-covariance between the 
inducing and the latent points and Kfu = Kuf ■ 

In order to perform variational inference in this expanded 
probability model, we introduce the variational distribu¬ 
tions ( 7 (X|Z) and (/(U), which are both taken to be Gaus¬ 
sian. For convenience, we drop the inducing points X^ 
from our expressions for the remainder of the Appendix, 
for convenience. We now have: 

logp(Y|X„) = 

log [ p(U)p(X) [ p(Y|F)p(F|U,X). 
Ju,x Jf 

By applying Jensen’s inequality, we obtain a lower bound 
J^(( 7 (X), (/(U)) on the above marginal likelihood, where: 


At this point, our variational bound is similar to the one of 
equation (7), but the first term, here denoted as T, refers 
to the expanded probability space and, thus, involves the 
inducing inputs and the additional variational distribution 
( 7 (U). Since the second term (the KL term) is tractable (be¬ 
cause it only involves Gaussian distributions), we are now 
going to focus on the F term. By breaking the logarithm 
again, we can further write this term as: 

-F = X 9(U)7(X|Z) log p(Y|F)p(F|U, X)^ 
-KL(g(U)||p(U)) (A.l). 

We notice that we can make use of Jensen’s inequality once 
more, because: 

log (^^p(Y|F)p(F|U,X)^ > ^p(F|U,X)logp(Y|F). 

This expectation is analytically tractable. Indeed, for a sin¬ 
gle dimension j, we can find this expectation as: 

[ P(fj|uj>X)logp(yj|f,) = 

Jf, 

logAA(y,|a„r'l)-ftr(K) 

+ ^tr(K-^KufKfu), 

where K is the covariance matrix constructed by evaluating 
the covariance function on the training inputs X. The full 
expression can be found by taking the appropriate product 
with respect to dimensions; indeed, since the joint proba¬ 
bility factorises with respect to output dimensions j, then 
a bound to the logarithm of the marginal likelihood can be 
written as a sum over terms, where every term considers 
a single dimension j. Notice that to obtain this tractable 
bound we did not explicitly make the assumption of equa¬ 
tion (9) about the form of the variational distribution. How¬ 
ever, this assumption is still made implicitly and the equiv¬ 
alence of the two derivations is rather instructive with re¬ 
spect to the effect of a variational constraint. 

We also notice that in the above expression, the covariance 
matrix K is no longer inverted. Therefore, by writting the 
term in this form, we manage to obtain an expression 
which allows the uncertainty in X to be propagated through 
the GP mapping. 


J-(g(X|Z),9(U)) = 

[ q{V)q{X\Z)\og 

JU,X 

= [ <?(U)g(X|Z)log 
-KL(g(X|Z)|b(X)) 


p(U)p(X)/pp(Y|F)p(F|U,X) 

q{U)q{X\Z) 

p(U)/pP(Y|F)p(F|U,X) 

9(U) 


:=.F-KL(g(X|Z)||p(X)). 


It is possible to also obtain a “tighter” variational bound 
J^(( 7 (X|Z)) > J^(( 7 (U),g(X|Z)) which does not depend 
on (?(U). To do so, we need to “collect” all terms that 
contain p(U) from equation (A.l) and find the stationary 
point with respect to the distribution g(U) (by computing 
the gradient w.r.t g(U) and setting it to zero). By doing 
so, we are then able to replace q{XJ) with its optimal value 
back to the variational bound. Titsias and Lawrence [2010] 
further explain this trick. 



Toy data 


Motion capture data 



Figure 5; MSE for predictions obtained by different methods on semi-described learning (full version of figure 2). Compar¬ 
ing our method (SD-GP), the standard GP method, multiple linear regression (MLR), nearest neighbour regression on the 
input space (NN), the data-imputation method based on GP-LVM and the mean predictor (mean). The results for simulated 
data are obtained from 4 trials. The GP method cannot handle partial observations, thus the uncertainty (2cr) is constant; 
for clarity, the errorbar is plotted separately on the right of the dashed vertical line (for nonsensical x values). The GP-LVM 
method produced huge errorbars (about 3.5 times larger than thos of MLR), thus we don’t plot them here, for clarity. 


B APPENDIX: MORE DETAILS FOR 
THE SEMI-DESCRIBED LEARNING 
EXPERIMENT 


In Section 3.1 we looked at performing predictions with 
Gaussian processes trained from partially observed inputs. 
Our method (semi-described GP or SD-GP) was compared 
to other approaches in figure 2, but the limit in the y—axis 
was fixed to a smaller value to show the comparison with 
the standard GP method more clearly. Lor the same reason, 
methods which produced very large errors were omitted. In 
this appendix we show the full figure from all the obtained 
results - hgure 5. 

The conclusion drawn from hgure 5 is that our method is 
very efficient in taking into account the extra, partially ob¬ 
served input set Z“. This is true even if this extra set only 
has a small proportion of features observed. On the other 
hand, nearest neighbour runs into difficulties when real data 
are considered and, even worse, produces huge errors when 
more than 60% of the features are missing in Z“. Linally, 
the baseline which uses the standard GP-LVM as a means 
of imputing missing values produces bad results, in fact 
worse compared to if the extra set Z" is just ignored (i.e. 
the GP baseline). This is because the baseline using GP- 
LVM treats the input space as single point estimates; by 
not incorporating (and optimising jointly) the uncertainty 
for each input location, the model has no way of ignoring 
“bad” imputed values. 


C APPENDIX: MORE DETAILS FOR 
THE AUTO-REGRESSIVE 
EXPERIMENT 

This appendix refers to the auto-regressive Gaussian pro¬ 
cess model developed in Section 3.2. In hgure 3 we showed 
the results from the last 310 steps of the iterative forecast¬ 
ing task. Here (hgure 6) we show the rest of the predictive 
sequence, obtained for extrapolating up until 1110 steps. 
The corresponding quantihcation of the error is shown in 
Table 1. 

Table 1; Mean squared and mean absolute error obtained 
when extrapolating in the chaotic time-series data. GPuncen 
refers to the basic (moment matching) method of Girard 
et al. [2003] and the “naive” autoregressive GP approach is 
the one which does not propagate uncertainties. 


Method 

MAL 

MSL 

ours 

0.529 

0.550 

GPuncert 

0.700 

0.914 

“naive” GP approach 

0.799 

1.157 


D APPENDIX: THE EFFECT OF g, p, n IN 
SEMI-DESCRIBED LEARNING 

As mentioned in Section 3.1, we found that when q is large 
compared to p and n, then the data imputation step of our 










































algorithm can be problematic as the percentage of miss¬ 
ing features in approaches 100%. This is somehow 
a corner-case, but it still shows that the method is reliant 
on having some covariates available. To investigate further 
this issue we created simulated data as explained in Section 
3.1, but this time multiple datasets were generated with dif¬ 
ferent input and output dimensions, q and p respectively. In 
hgure 7 we show the comparison of SD-GP and the stan¬ 
dard GP (which ignores Z") for different selections of q, p 
and percentage of missing features in Z". 


q=5, 50% missing q=15, 50% missing 



Z“. However, when the percentage of missing fea¬ 
tures is very large and the relative size of p and n is 
small compared to q, then the method can produce 
worst results compared to the standard GP. 

To explain the challenge of handling missing values with 
SD-GP, consider that a separate variational parameter ex¬ 
ists for every input, namely the parameters , 5'“^, i = 
1, = 1,..., q of step 7 in Algorithm 1. In the extreme 

cases mentioned in the previous paragraph, the number of 
variational parameters remains large but the available co¬ 
variates to learn from are too few. This renders the optimi¬ 
sation of the parameters very difficult. 


q=5, 90% missing q=15, 90% missing 



q=5, 100% missing q=15, 100% missing 



Figure 7; Comparison of our method (SD-GP) and the stan¬ 
dard GP (which ignores Z") for different selections of q, p 
and percentage of missing features in Z“. 

The summary of this experiment is that; 

• For the most usual scenarios, i.e. when the percentage 
of features missing is not too high, SD-GP performs 
very well, but as p and n become small compared to 
q, then the performance of the method seems to dete¬ 
riorate. 

• Even if 100% of the features are missing in Z“, us¬ 
ing our SD-GP can still be advantageous compared 
to using a standard GP. This is because SD-GP can 
utilise the extra information in the fully observed out¬ 
puts, Y“, which correspond to the fully missing set 
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Figure 6; The full predictions obtained by the competing methods for the chaotic time-series data. The top 3 plots show the 
values obtained in each predictive step for each of the compared methods; the plot on the bottom shows the corresponding 
predictive uncertainties (2cr). GPuncert refers to the basic (moment matching) method of Girard et al. [2003] and the GP is 
the “naive” autoregressive GP which does not propagate uncertainties. 




































































